2T_Pandas로 배우는 SQL 시작하기 (2) - JOIN ON


In [1]:
import pymysql

db = pymysql.connect(
    "db.fastcamp.us",
    "root",             
    "dkstncks",
    "world",
    charset='utf8',
)
df = pd.read_sql("SELECT * FROM Country;", db)

In [2]:
#cursor

In [3]:
cursor = db.cursor()

In [5]:
# 1. 실제로 명령을 수행하는 부분 - 서버
cursor.execute("SELECT * FROM Country;")


Out[5]:
239

In [7]:
# 2. 데이터를 가져오는 부분 - 서버 => 클라이언트
cursor.fetchall()   #결과를 불러 온다. Pandas는 사실 이걸 읽는 것이다.


Out[7]:
()

In [8]:
pd.read_sql("SELECT * FROM Country;", db)


Out[8]:
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
0 ABW Aruba North America Caribbean 193.0 NaN 103000 78.4 828.0 793.0 Aruba Nonmetropolitan Territory of The Netherlands Beatrix 129.0 AW
1 AFG Afghanistan Asia Southern and Central Asia 652090.0 1919.0 22720000 45.9 5976.0 NaN Afganistan/Afqanestan Islamic Emirate Mohammad Omar 1.0 AF
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
3 AIA Anguilla North America Caribbean 96.0 NaN 8000 76.1 63.2 NaN Anguilla Dependent Territory of the UK Elisabeth II 62.0 AI
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL
5 AND Andorra Europe Southern Europe 468.0 1278.0 78000 83.5 1630.0 NaN Andorra Parliamentary Coprincipality 55.0 AD
6 ANT Netherlands Antilles North America Caribbean 800.0 NaN 217000 74.7 1941.0 NaN Nederlandse Antillen Nonmetropolitan Territory of The Netherlands Beatrix 33.0 AN
7 ARE United Arab Emirates Asia Middle East 83600.0 1971.0 2441000 74.1 37966.0 36846.0 Al-Imarat al-´Arabiya al-Muttahida Emirate Federation Zayid bin Sultan al-Nahayan 65.0 AE
8 ARG Argentina South America South America 2780400.0 1816.0 37032000 75.1 340238.0 323310.0 Argentina Federal Republic Fernando de la Rúa 69.0 AR
9 ARM Armenia Asia Middle East 29800.0 1991.0 3520000 66.4 1813.0 1627.0 Hajastan Republic Robert Kotšarjan 126.0 AM
10 ASM American Samoa Oceania Polynesia 199.0 NaN 68000 75.1 334.0 NaN Amerika Samoa US Territory George W. Bush 54.0 AS
11 ATA Antarctica Antarctica Antarctica 13120000.0 NaN 0 NaN 0.0 NaN Co-administrated NaN AQ
12 ATF French Southern territories Antarctica Antarctica 7780.0 NaN 0 NaN 0.0 NaN Terres australes françaises Nonmetropolitan Territory of France Jacques Chirac NaN TF
13 ATG Antigua and Barbuda North America Caribbean 442.0 1981.0 68000 70.5 612.0 584.0 Antigua and Barbuda Constitutional Monarchy Elisabeth II 63.0 AG
14 AUS Australia Oceania Australia and New Zealand 7741220.0 1901.0 18886000 79.8 351182.0 392911.0 Australia Constitutional Monarchy, Federation Elisabeth II 135.0 AU
15 AUT Austria Europe Western Europe 83859.0 1918.0 8091800 77.7 211860.0 206025.0 Österreich Federal Republic Thomas Klestil 1523.0 AT
16 AZE Azerbaijan Asia Middle East 86600.0 1991.0 7734000 62.9 4127.0 4100.0 Azärbaycan Federal Republic Heydär Äliyev 144.0 AZ
17 BDI Burundi Africa Eastern Africa 27834.0 1962.0 6695000 46.2 903.0 982.0 Burundi/Uburundi Republic Pierre Buyoya 552.0 BI
18 BEL Belgium Europe Western Europe 30518.0 1830.0 10239000 77.8 249704.0 243948.0 België/Belgique Constitutional Monarchy, Federation Albert II 179.0 BE
19 BEN Benin Africa Western Africa 112622.0 1960.0 6097000 50.2 2357.0 2141.0 Bénin Republic Mathieu Kérékou 187.0 BJ
20 BFA Burkina Faso Africa Western Africa 274000.0 1960.0 11937000 46.7 2425.0 2201.0 Burkina Faso Republic Blaise Compaoré 549.0 BF
21 BGD Bangladesh Asia Southern and Central Asia 143998.0 1971.0 129155000 60.2 32852.0 31966.0 Bangladesh Republic Shahabuddin Ahmad 150.0 BD
22 BGR Bulgaria Europe Eastern Europe 110994.0 1908.0 8190900 70.9 12178.0 10169.0 Balgarija Republic Petar Stojanov 539.0 BG
23 BHR Bahrain Asia Middle East 694.0 1971.0 617000 73.0 6366.0 6097.0 Al-Bahrayn Monarchy (Emirate) Hamad ibn Isa al-Khalifa 149.0 BH
24 BHS Bahamas North America Caribbean 13878.0 1973.0 307000 71.1 3527.0 3347.0 The Bahamas Constitutional Monarchy Elisabeth II 148.0 BS
25 BIH Bosnia and Herzegovina Europe Southern Europe 51197.0 1992.0 3972000 71.5 2841.0 NaN Bosna i Hercegovina Federal Republic Ante Jelavic 201.0 BA
26 BLR Belarus Europe Eastern Europe 207600.0 1991.0 10236000 68.0 13714.0 NaN Belarus Republic Aljaksandr Lukašenka 3520.0 BY
27 BLZ Belize North America Central America 22696.0 1981.0 241000 70.9 630.0 616.0 Belize Constitutional Monarchy Elisabeth II 185.0 BZ
28 BMU Bermuda North America North America 53.0 NaN 65000 76.9 2328.0 2190.0 Bermuda Dependent Territory of the UK Elisabeth II 191.0 BM
29 BOL Bolivia South America South America 1098581.0 1825.0 8329000 63.7 8571.0 7967.0 Bolivia Republic Hugo Bánzer Suárez 194.0 BO
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
209 TKL Tokelau Oceania Polynesia 12.0 NaN 2000 NaN 0.0 NaN Tokelau Nonmetropolitan Territory of New Zealand Elisabeth II 3333.0 TK
210 TKM Turkmenistan Asia Southern and Central Asia 488100.0 1991.0 4459000 60.9 4397.0 2000.0 Türkmenostan Republic Saparmurad Nijazov 3419.0 TM
211 TMP East Timor Asia Southeast Asia 14874.0 NaN 885000 46.0 0.0 NaN Timor Timur Administrated by the UN José Alexandre Gusmão 1522.0 TP
212 TON Tonga Oceania Polynesia 650.0 1970.0 99000 67.9 146.0 170.0 Tonga Monarchy Taufa'ahau Tupou IV 3334.0 TO
213 TTO Trinidad and Tobago North America Caribbean 5130.0 1962.0 1295000 68.0 6232.0 5867.0 Trinidad and Tobago Republic Arthur N. R. Robinson 3336.0 TT
214 TUN Tunisia Africa Northern Africa 163610.0 1956.0 9586000 73.7 20026.0 18898.0 Tunis/Tunisie Republic Zine al-Abidine Ben Ali 3349.0 TN
215 TUR Turkey Asia Middle East 774815.0 1923.0 66591000 71.0 210721.0 189122.0 Türkiye Republic Ahmet Necdet Sezer 3358.0 TR
216 TUV Tuvalu Oceania Polynesia 26.0 1978.0 12000 66.3 6.0 NaN Tuvalu Constitutional Monarchy Elisabeth II 3424.0 TV
217 TWN Taiwan Asia Eastern Asia 36188.0 1945.0 22256000 76.4 256254.0 263451.0 T’ai-wan Republic Chen Shui-bian 3263.0 TW
218 TZA Tanzania Africa Eastern Africa 883749.0 1961.0 33517000 52.3 8005.0 7388.0 Tanzania Republic Benjamin William Mkapa 3306.0 TZ
219 UGA Uganda Africa Eastern Africa 241038.0 1962.0 21778000 42.9 6313.0 6887.0 Uganda Republic Yoweri Museveni 3425.0 UG
220 UKR Ukraine Europe Eastern Europe 603700.0 1991.0 50456000 66.0 42168.0 49677.0 Ukrajina Republic Leonid Kutšma 3426.0 UA
221 UMI United States Minor Outlying Islands Oceania Micronesia/Caribbean 16.0 NaN 0 NaN 0.0 NaN United States Minor Outlying Islands Dependent Territory of the US George W. Bush NaN UM
222 URY Uruguay South America South America 175016.0 1828.0 3337000 75.2 20831.0 19967.0 Uruguay Republic Jorge Batlle Ibáñez 3492.0 UY
223 USA United States North America North America 9363520.0 1776.0 278357000 77.1 8510700.0 8110900.0 United States Federal Republic George W. Bush 3813.0 US
224 UZB Uzbekistan Asia Southern and Central Asia 447400.0 1991.0 24318000 63.7 14194.0 21300.0 Uzbekiston Republic Islam Karimov 3503.0 UZ
225 VAT Holy See (Vatican City State) Europe Southern Europe 0.4 1929.0 1000 NaN 9.0 NaN Santa Sede/Città del Vaticano Independent Church State Johannes Paavali II 3538.0 VA
226 VCT Saint Vincent and the Grenadines North America Caribbean 388.0 1979.0 114000 72.3 285.0 NaN Saint Vincent and the Grenadines Constitutional Monarchy Elisabeth II 3066.0 VC
227 VEN Venezuela South America South America 912050.0 1811.0 24170000 73.1 95023.0 88434.0 Venezuela Federal Republic Hugo Chávez Frías 3539.0 VE
228 VGB Virgin Islands, British North America Caribbean 151.0 NaN 21000 75.4 612.0 573.0 British Virgin Islands Dependent Territory of the UK Elisabeth II 537.0 VG
229 VIR Virgin Islands, U.S. North America Caribbean 347.0 NaN 93000 78.1 0.0 NaN Virgin Islands of the United States US Territory George W. Bush 4067.0 VI
230 VNM Vietnam Asia Southeast Asia 331689.0 1945.0 79832000 69.3 21929.0 22834.0 Viêt Nam Socialistic Republic Trân Duc Luong 3770.0 VN
231 VUT Vanuatu Oceania Melanesia 12189.0 1980.0 190000 60.6 261.0 246.0 Vanuatu Republic John Bani 3537.0 VU
232 WLF Wallis and Futuna Oceania Polynesia 200.0 NaN 15000 NaN 0.0 NaN Wallis-et-Futuna Nonmetropolitan Territory of France Jacques Chirac 3536.0 WF
233 WSM Samoa Oceania Polynesia 2831.0 1962.0 180000 69.2 141.0 157.0 Samoa Parlementary Monarchy Malietoa Tanumafili II 3169.0 WS
234 YEM Yemen Asia Middle East 527968.0 1918.0 18112000 59.8 6041.0 5729.0 Al-Yaman Republic Ali Abdallah Salih 1780.0 YE
235 YUG Yugoslavia Europe Southern Europe 102173.0 1918.0 10640000 72.4 17000.0 NaN Jugoslavija Federal Republic Vojislav Koštunica 1792.0 YU
236 ZAF South Africa Africa Southern Africa 1221037.0 1910.0 40377000 51.1 116729.0 129092.0 South Africa Republic Thabo Mbeki 716.0 ZA
237 ZMB Zambia Africa Eastern Africa 752618.0 1964.0 9169000 37.2 3377.0 3922.0 Zambia Republic Frederick Chiluba 3162.0 ZM
238 ZWE Zimbabwe Africa Eastern Africa 390757.0 1980.0 11669000 37.8 5951.0 8670.0 Zimbabwe Republic Robert G. Mugabe 4068.0 ZW

239 rows × 15 columns


In [ ]:

텍스트 마이닝

  • 특정 텍스트가 포함되어 있는 row를 가져오는 방법
  • GovernmentForm 에 "Republic"이라는 텍스트가 포함된 열 가져오기

1. pandas로

  • contains, startswith, endswith

In [10]:
country_df = pd.read_sql("SELECT * FROM Country;", db)

In [13]:
country_df[country_df["GovernmentForm"].str.contains("Republic")].head(2)


Out[13]:
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL

In [14]:
country_df[country_df["GovernmentForm"].str.startswith("Republic")].head(2)


Out[14]:
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL

In [15]:
country_df[country_df["GovernmentForm"].str.endswith("Republic")].head(2)


Out[15]:
Code Name Continent Region SurfaceArea IndepYear Population LifeExpectancy GNP GNPOld LocalName GovernmentForm HeadOfState Capital Code2
2 AGO Angola Africa Central Africa 1246700.0 1975.0 12878000 38.3 6648.0 7984.0 Angola Republic José Eduardo dos Santos 56.0 AO
4 ALB Albania Europe Southern Europe 28748.0 1912.0 3401200 71.6 3205.0 2500.0 Shqipëria Republic Rexhep Mejdani 34.0 AL

In [17]:
gf_str = country_df["GovernmentForm"].str

In [18]:
gf_str. #이렇게 해서 str에 어떤 기능들이 있는 지 확인할 수 있다.


  File "<ipython-input-18-88646dcfd66a>", line 1
    gf_str. #이렇게 해서 str에 어떤 기능들이 있는 지 확인할 수 있다.
                                               ^
SyntaxError: invalid syntax

In [19]:
# SQL을 이용

SQL_QUERY = """
    SELECT Name, GovernmentForm
    From Country
    WHERE
        GovernmentForm LIKE "Republic"
    ;
"""

pd.read_sql(SQL_QUERY, db).head()


Out[19]:
Name GovernmentForm
0 Angola Republic
1 Albania Republic
2 Armenia Republic
3 Burundi Republic
4 Benin Republic

In [22]:
SQL_QUERY = """
    SELECT Name, GovernmentForm
    From Country
    WHERE
        GovernmentForm LIKE "%Republic%"
    ;
"""

pd.read_sql(SQL_QUERY, db)


Out[22]:
Name GovernmentForm
0 Angola Republic
1 Albania Republic
2 Argentina Federal Republic
3 Armenia Republic
4 Austria Federal Republic
5 Azerbaijan Federal Republic
6 Burundi Republic
7 Benin Republic
8 Burkina Faso Republic
9 Bangladesh Republic
10 Bulgaria Republic
11 Bosnia and Herzegovina Federal Republic
12 Belarus Republic
13 Bolivia Republic
14 Brazil Federal Republic
15 Botswana Republic
16 Central African Republic Republic
17 Chile Republic
18 China People'sRepublic
19 Côte d’Ivoire Republic
20 Cameroon Republic
21 Congo, The Democratic Republic of the Republic
22 Congo Republic
23 Colombia Republic
24 Comoros Republic
25 Cape Verde Republic
26 Costa Rica Republic
27 Cuba Socialistic Republic
28 Cyprus Republic
29 Czech Republic Republic
... ... ...
113 San Marino Republic
114 Somalia Republic
115 Sao Tome and Principe Republic
116 Suriname Republic
117 Slovakia Republic
118 Slovenia Republic
119 Seychelles Republic
120 Syria Republic
121 Chad Republic
122 Togo Republic
123 Tajikistan Republic
124 Turkmenistan Republic
125 Trinidad and Tobago Republic
126 Tunisia Republic
127 Turkey Republic
128 Taiwan Republic
129 Tanzania Republic
130 Uganda Republic
131 Ukraine Republic
132 Uruguay Republic
133 United States Federal Republic
134 Uzbekistan Republic
135 Venezuela Federal Republic
136 Vietnam Socialistic Republic
137 Vanuatu Republic
138 Yemen Republic
139 Yugoslavia Federal Republic
140 South Africa Republic
141 Zambia Republic
142 Zimbabwe Republic

143 rows × 2 columns

JOIN(pandas에서는 merge 기능)

  • "sakila"로 들어와서 실습

In [24]:
db = pymysql.connect(
    "db.fastcamp.us",
    "root",
    "dkstncks",
    "sakila",
    charset = "utf8",
)
  • address(주소), customer(이름)
  • pandas merge를 이용해서 유저의 이름과 주소가 같이 있는 DataFrame을 만들어보기

In [26]:
customer_df = pd.read_sql("SELECT * FROM customer;", db)

In [28]:
address_df = pd.read_sql("SELECT * FROM address;", db)

In [29]:
customer_df.columns


Out[29]:
Index(['customer_id', 'store_id', 'first_name', 'last_name', 'email',
       'address_id', 'active', 'create_date', 'last_update'],
      dtype='object')

In [30]:
address_df.columns


Out[30]:
Index(['address_id', 'address', 'address2', 'district', 'city_id',
       'postal_code', 'phone', 'last_update'],
      dtype='object')

In [32]:
customer_df.merge(address_df, on="address_id")[["first_name", "last_name", "address"]]
# left_on="address_id"
# right_on="address_id"
# => on...


Out[32]:
first_name last_name address
0 MARY SMITH 1913 Hanoi Way
1 PATRICIA JOHNSON 1121 Loja Avenue
2 LINDA WILLIAMS 692 Joliet Street
3 BARBARA JONES 1566 Inegl Manor
4 ELIZABETH BROWN 53 Idfu Parkway
5 JENNIFER DAVIS 1795 Santiago de Compostela Way
6 MARIA MILLER 900 Santiago de Compostela Parkway
7 SUSAN WILSON 478 Joliet Way
8 MARGARET MOORE 613 Korolev Drive
9 DOROTHY TAYLOR 1531 Sal Drive
10 LISA ANDERSON 1542 Tarlac Parkway
11 NANCY THOMAS 808 Bhopal Manor
12 KAREN JACKSON 270 Amroha Parkway
13 BETTY WHITE 770 Bydgoszcz Avenue
14 HELEN HARRIS 419 Iligan Lane
15 SANDRA MARTIN 360 Toulouse Parkway
16 DONNA THOMPSON 270 Toulon Boulevard
17 CAROL GARCIA 320 Brest Avenue
18 RUTH MARTINEZ 1417 Lancaster Avenue
19 SHARON ROBINSON 1688 Okara Way
20 MICHELLE CLARK 262 A Corua (La Corua) Parkway
21 LAURA RODRIGUEZ 28 Charlotte Amalie Street
22 SARAH LEWIS 1780 Hino Boulevard
23 KIMBERLY LEE 96 Tafuna Way
24 DEBORAH WALKER 934 San Felipe de Puerto Plata Street
25 JESSICA HALL 18 Duisburg Boulevard
26 SHIRLEY ALLEN 217 Botshabelo Place
27 CYNTHIA YOUNG 1425 Shikarpur Manor
28 ANGELA HERNANDEZ 786 Aurora Avenue
29 MELISSA KING 1668 Anpolis Street
... ... ... ...
569 IVAN CROMWELL 1351 Sousse Lane
570 JOHNNIE CHISHOLM 1501 Pangkal Pinang Avenue
571 SIDNEY BURLESON 1405 Hagonoy Avenue
572 BYRON BOX 521 San Juan Bautista Tuxtepec Place
573 JULIAN VEST 923 Tangail Boulevard
574 ISAAC OGLESBY 186 Skikda Lane
575 MORRIS MCCARTER 1568 Celaya Parkway
576 CLIFTON MALCOLM 1489 Kakamigahara Lane
577 WILLARD LUMPKIN 1819 Alessandria Loop
578 DARYL LARUE 1208 Tama Loop
579 ROSS GREY 951 Springs Lane
580 VIRGIL WOFFORD 760 Miyakonojo Drive
581 ANDY VANHORN 966 Asuncin Way
582 MARSHALL THORN 1584 Ljubertsy Lane
583 SALVADOR TEEL 247 Jining Parkway
584 PERRY SWAFFORD 773 Dallas Manor
585 KIRK STCLAIR 1923 Stara Zagora Lane
586 SERGIO STANFIELD 1402 Zanzibar Boulevard
587 MARION OCAMPO 1464 Kursk Parkway
588 TRACY HERRMANN 1074 Sanaa Parkway
589 SETH HANNON 1759 Niznekamsk Avenue
590 KENT ARSENAULT 32 Liaocheng Way
591 TERRANCE ROUSH 42 Fontana Avenue
592 RENE MCALISTER 1895 Zhezqazghan Drive
593 EDUARDO HIATT 1837 Kaduna Parkway
594 TERRENCE GUNDERSON 844 Bucuresti Place
595 ENRIQUE FORSYTHE 1101 Bucuresti Boulevard
596 FREDDIE DUGGAN 1103 Quilmes Boulevard
597 WADE DELVALLE 1331 Usak Boulevard
598 AUSTIN CINTRON 1325 Fukuyama Street

599 rows × 3 columns

  • sql로 하는 첫번째 방법 => 바보 같은 방법이지만 중요하다. 원리를 이해해라

In [34]:
SQL_QUERY = """
    SELECT COUNT(*)
    FROM customer
    ;
"""

pd.read_sql(SQL_QUERY, db).head()


Out[34]:
COUNT(*)
0 599

In [35]:
SQL_QUERY = """
    SELECT COUNT(*)
    FROM address
    ;
"""

pd.read_sql(SQL_QUERY, db).head()


Out[35]:
COUNT(*)
0 603
  • customer 599개 address는 603개가 있는 상황

In [37]:
SQL_QUERY = """
    SELECT COUNT(*)
    FROM customer, address
    ;
"""

pd.read_sql(SQL_QUERY, db).head()


Out[37]:
COUNT(*)
0 361197
  • 599 * 603 = 361197
  • 데이터를 받아오는 데 오랜 시간이 걸렸다.

In [40]:
SQL_QUERY = """
    SELECT customer.first_name, customer.last_name, address.address
    FROM customer, address
    WHERE
        customer.address_id = address.address_id
    ;
"""

df = pd.read_sql(SQL_QUERY, db).head()
  • 즉, 데이터 테이블을 합친 다음에 ( 모든 row에 대한 곱 )
  • WHERE 조건문을 통해서 우리가 원하는 데이터만 가져오자.

In [43]:
SQL_QUERY = """
    SELECT customer.first_name, customer.last_name, address.address
    FROM customer
    JOIN address ON customer.address_id = address.address_id
    ;
"""

pd.read_sql(SQL_QUERY, db).head()


Out[43]:
first_name last_name address
0 MARY SMITH 1913 Hanoi Way
1 PATRICIA JOHNSON 1121 Loja Avenue
2 LINDA WILLIAMS 692 Joliet Street
3 BARBARA JONES 1566 Inegl Manor
4 ELIZABETH BROWN 53 Idfu Parkway

In [52]:
import time
start_time = time.time()

customer_df = pd.read_sql("SELECT * FROM customer;", db)
address_df = pd.read_sql("SELECT * FROM address;", db)
df = customer_df.merge(address_df, on="address_id")

end_time = time.time()
exec_time = end_time - start_time
print(exec_time)


0.26601505279541016

In [64]:
start_time = time.time()

SQL_QUERY = """
    SELECT customer.first_name, customer.last_name, address.address
    FROM customer, address
    WHERE
        customer.address_id = address.address_id
    ;
"""
df = pd.read_sql(SQL_QUERY, db)

end_time = time.time()
exec_time = end_time - start_time
print(exec_time)


0.05300307273864746

In [63]:
start_time = time.time()

SQL_QUERY = """
    SELECT customer.first_name, customer.last_name, address.address
    FROM customer
    JOIN address ON customer.address_id = address.address_id
    ;
"""
pd.read_sql(SQL_QUERY, db)

end_time = time.time()
exec_time = end_time - start_time
print(exec_time)


0.04200243949890137
  • pandas보다 sql에서 JOIN을 활용
  • sql 1번과 2번 중 무엇을 써야 하는가? JOIN!! JOIN을 쓰면 코드가 한 눈에 안 들어온다.
  • SQL (1) - SELECT __ WHERE
  • SQL (2) - SELECT FROM JOIN __
  • 굳이 JOIN을 안 쓰더라도 상관 없어. WHERE문만 써도 괜찮아

World DB에서 Country와 City를 합쳐라. (Country Name, City Name)

3가지 방법으로. SQL(where, join)과 Pandas(merge)


In [65]:
db = pymysql.connect(
    "db.fastcamp.us",
    "root",             
    "dkstncks",
    "world",
    charset='utf8',
)
country_df = pd.read_sql("SELECT * FROM Country;", db)
city_df = pd.read_sql("SELECT * FROM City;", db)

(1) Pandas_merge


In [66]:
country_df.columns


Out[66]:
Index(['Code', 'Name', 'Continent', 'Region', 'SurfaceArea', 'IndepYear',
       'Population', 'LifeExpectancy', 'GNP', 'GNPOld', 'LocalName',
       'GovernmentForm', 'HeadOfState', 'Capital', 'Code2'],
      dtype='object')

In [67]:
city_df.columns


Out[67]:
Index(['ID', 'Name', 'CountryCode', 'District', 'Population'], dtype='object')

In [73]:
city_df.merge(country_df, right_on="Code", left_on="CountryCode")[["Name_x", "Name_y"]]


Out[73]:
Name_x Name_y
0 Kabul Afghanistan
1 Qandahar Afghanistan
2 Herat Afghanistan
3 Mazar-e-Sharif Afghanistan
4 Amsterdam Netherlands
5 Rotterdam Netherlands
6 Haag Netherlands
7 Utrecht Netherlands
8 Eindhoven Netherlands
9 Tilburg Netherlands
10 Groningen Netherlands
11 Breda Netherlands
12 Apeldoorn Netherlands
13 Nijmegen Netherlands
14 Enschede Netherlands
15 Haarlem Netherlands
16 Almere Netherlands
17 Arnhem Netherlands
18 Zaanstad Netherlands
19 ´s-Hertogenbosch Netherlands
20 Amersfoort Netherlands
21 Maastricht Netherlands
22 Dordrecht Netherlands
23 Leiden Netherlands
24 Haarlemmermeer Netherlands
25 Zoetermeer Netherlands
26 Emmen Netherlands
27 Zwolle Netherlands
28 Ede Netherlands
29 Delft Netherlands
... ... ...
4049 Roanoke United States
4050 Billings United States
4051 Compton United States
4052 Gainesville United States
4053 Fairfield United States
4054 Arden-Arcade United States
4055 San Mateo United States
4056 Visalia United States
4057 Boulder United States
4058 Cary United States
4059 Santa Monica United States
4060 Fall River United States
4061 Kenosha United States
4062 Elgin United States
4063 Odessa United States
4064 Carson United States
4065 Charleston United States
4066 Charlotte Amalie Virgin Islands, U.S.
4067 Harare Zimbabwe
4068 Bulawayo Zimbabwe
4069 Chitungwiza Zimbabwe
4070 Mount Darwin Zimbabwe
4071 Mutare Zimbabwe
4072 Gweru Zimbabwe
4073 Gaza Palestine
4074 Khan Yunis Palestine
4075 Hebron Palestine
4076 Jabaliya Palestine
4077 Nablus Palestine
4078 Rafah Palestine

4079 rows × 2 columns

1. 바보 같지만 원리가 중요한 방법(sql)


In [77]:
SQL_QUERY = """
    SELECT Country.Name "Country Name", City.Name "City Name"
    FROM Country, City
    WHERE Country.Code = City.CountryCode
    ;
"""

pd.read_sql(SQL_QUERY, db).head(2)


Out[77]:
Country Name City Name
0 Aruba Oranjestad
1 Afghanistan Kabul

In [78]:
SQL_QUERY = """
    SELECT co.Name "Country Name", ci.Name "City Name"
    FROM Country co, City ci
    WHERE co.Code = ci.CountryCode
    ;
"""

pd.read_sql(SQL_QUERY, db).head(2)


Out[78]:
Country Name City Name
0 Aruba Oranjestad
1 Afghanistan Kabul

2. JOIN을 이용한 방법(DB를 생각해주는 착한 방법)


In [80]:
SQL_QUERY = """
    SELECT co.Name "Country Name", ci.Name "City Name"
    FROM Country co
        JOIN City ci 
        ON co.code = ci.CountryCode
    ;
"""

pd.read_sql(SQL_QUERY, db).head(2)


Out[80]:
Country Name City Name
0 Aruba Oranjestad
1 Afghanistan Kabul